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Abstract 

The accurate reconstruction of phylogenies from short molecular sequen- 
ces is an important problem in computational biology. Recent work has high- 
lighted deep connections between sequence-length requirements for high- 
probability phylogeny reconstruction and the related problem of the estima- 
tion of ancestral sequences. In [Daskalakis et al.'09], building on the work of 
[Mosser04], a tight sequence-length requirement was obtained for the sim- 
ple CFN model of substitution, that is, the case of a two-state symmetric 
rate matrix Q. In particular the required sequence length for high-probability 
reconstruction was shown to undergo a sharp transition (from 0(log?7,) to 
poly(n), where n is the number of leaves) at the "critical" branch length 
guhiQ) (if it exists) of the ancestral reconstruction problem defined roughly 
as follows: below 5ml (Q) the sequence at the root can be accurately esti- 
mated from sequences at the leaves on deep trees, whereas above guhiQ) 
information decays exponentially quickly down the tree. 

Here we consider a more general evolutionary model, the GTR model, 
where the qx q rate matrix Q is reversible with q> 2. For this model, recent 
results of [Roch'09] show that the tree can be accurately reconstructed with 
sequences of length 0(log(n)) when the branch lengths are below gun{Q), 
known as the Kesten-Stigum (KS) bound, up to which ancestral sequences 
can be accurately estimated using simple linear estimators. Although for the 
CFN model guhiQ) — gun{Q) (in other words, linear ancestral estimators 
are in some sense best possible), it is known that for the more general GTR 
models one has .gML(Q) > guniQ) with a strict inequality in many cases. 
Here, we show that this phenomenon also holds for phylogenetic reconstruc- 
tion by exhibiting a family of symmetric models Q and a phylogenetic recon- 
struction algorithm which recovers the tree from 0(log 7i)-length sequences 
for some branch lengths in the range {gun{Q), guhiQ))- Second we prove 
that phylogenetic reconstruction under GTR models requires a polynomial 
sequence-length for branch lengths above ,gML(Q)- 
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1 Introduction 



Background. Recent years have witnessed a convergence of models and prob- 
lems from evolutionary biology, statistical physics, and computer science. Stan- 
dard stochastic models of molecular evolution, such as the Cavender-Farris-Neyman 
(CFN) model (a.k.a. the Ising model or Binary Symmetric Channel (BSC)) or the 
Jukes-Cantor (JC) model (a.k.a. the Potts model), have been extensively studied 
from all these different perspectives and fruitful insights have emerged, notably in 
the area of computational phylogenetics. 

Phylogenetics [SS03, Fel04] is centered around the reconstruction of evolution- 
ary histories from molecular- data extracted from modern species. The assumption 
is that molecular data consists of aligned sequences and that each position in the 
sequences evolves independently according to a Markov model on a tree, where 
the key parameters are (see Section 2 for formal definitions): 

• Rate matrix. A q x q mutation rate matrix Q, where q is the alphabet size. 
A typical alphabet is the set of nucleotides {A, C, G, T}, but here we allow 
more general state spaces. Without loss of generality, we denote the alphabet 
by [q] = {1, . . . , q}. The (i, j)'th entry of Q encodes the rate at which state 
i mutates into state j. 

• Tree. An evolutionary tree T, where the leaves ar^e the modern species and 
each branching represents a past speciation event. We denote the leaves by 
N ={l,...,n}. 

• Branch lengths. For each edge e, we have a scalar branch length r(e) which 
measures the expected total number of substitutions per site along edge e. 
Roughly speaking, T(e) is the time duration between the end points of e 
multiplied by the mutation rate. 

We consider the following two closely related problems: 

1. Phylogenetic Tree Reconstruction (PTR). Given n molecular sequences of 
length k (one for each leaf) 

with G [q] , which have evolved according to the process above with inde- 
pendent sites, reconstruct the topology of the evolutionary tree. 

2. Ancestral State Reconstruction (ASR). Given a fully specified rooted tree 
and a single state s\ at each leaf a of the tree, estimate (better than "random") 
the state at the root of the tree, independently of the depth of the tree. 
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In both cases, longer edge lengths correspond to more mutations — and hence more 
noise — making both reconstruction problems more challenging. Our overriding 
goal is to extend efficient phylogenetic reconstruction to trees with as large branch 
lengths as possible. 

Reconstruction thresholds. Alternatively, the second problem can be interpreted 
in terms of correlation decay along the tree or as a broadcasting problem on a tree- 
network. It has thus been extensively studied in statistical physics, probability 
theory, and computer science. See e.g. [EKPSOO] and references therein. A crucial 
parameter in the ASR problem is r+(T) = maxg T(e), the maximal branch length 
in the tree. 

One class of ancestral estimators is particularly well understood, the so-called 
linear estimators. See Section 2 for a formal definition. In essence, linear- estima- 
tors are simply a form of weighted majority. In [MP03], it was shown that there 
exists a critical parameter gun{Q) = Ag^ In \/2, where — Ag is the largest negative 
eigenvalue of the rate matrix Q, such that: 

• if r"*" < gun{Q), for all trees with t'^{T) = r"*" a well-chosen linear esti- 
mator provides a good solution to the ASR, 

• if r"^ > gun{Q), there exist trees with T+(r) = r"*" for which ASR is 
impossible for any linear estimator, that is, the correlation between the best 
linear root estimate and the true root value decays exponentially in the depth 
of the tree. 

For formal definitions, see [MP03]. The threshold guniQ) — ^g^^^i is also 
known to be the critical threshold for robust (ancestral) reconstruction, see [JM04] 
for details. 

For more general ancestral estimators, only partial results are known. For the 
two-state symmetric Q (the CFN model), impossibility of reconstruction as above 
holds, when t+(T) > gun{Q), not only for linear estimators but also for any 
estimator, including for instance maximum likelihood. In other words, for the 
CFN model linear estimators are in some sense best possible. This phenomenon 
also holds for symmetric models (i.e., where all non-diagonal entries of Q are 
identical) with q = 3 states [Sly09] (at least, for high degree trees). However, 
for symmetric models on q > 5 states, it is known that ASR is possible beyond 
gun{Q), up to a critical branch length guhiQ) > gun{Q) which is not known 
explicitly [MosOl, Sly09]. Larger values of q here correspond for instance to mod- 
els of protein evolution. ASR beyond guniQ) can be achieved with a maximum 
likelihood estimator although in some cases special estimators have been devised 
(for instance, symmetric models with large q) [MosOl]. In this context, gun{Q) 
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is refered to as the Kesten-Stigum bound [KS67]. We sometimes call the condi- 
tion T+(r) < gun{Q) the "KS phase" and the condition T+(r) < guhiQ) the 
"reconstruction phase." 

For general reversible rate matrices, it is not even known whether there is a 
unique reconstruction threshold gmi^iQ) such that ASR is possible for t"'"(T) < 
5ml (Q) and impossible for T+(r) > 5ml The general question of finding the 
threshold QMLiQ) for ASR is extremely challenging and has been answered for 
only a very small number of channels. 

Steel's Conjecture. A striking conjecture of Steel [SteOl] postulates a deep con- 
nection between PTR and ASR. More specifically, the conjecture states that for 
CFN models if t+(T) < 5Lin(<5) then PTR can be achieved with sequence length 
k = O(logn). This says that, when we can accurately estimate the states of ver- 
tices deep inside a known tree, then it is also possible to accurately reconstruct the 
topology of an unknown tree with very short sequence lengths. 

In fact, since the number of trees on n labelled leaves is 2®("i°g"), this IS an 
optimal sequence length up to constant factors — that is, we cannot hope to distin- 
guish so many trees with fewer potential datasets. The proof of Steel's conjecture 
was established in [Mos04] for balanced trees and in [DMR09] for general (un- 
der the additional assumption that branch lengths are discretized). Furthermore, 
results of Mossel [Mos03, Mos04] show that for r+(r) > gun{Q) a polynomial 
sequence length is needed for correct phylogenetic reconstruction. For symmetric 
models, the results of [Mos04, DMR09] imply that it is possible to reconstruct phy- 
logenetic trees from sequences of length O(logn) when t"'"(T) < 5Lin(Q)- These 
results cover classical models such as the JC model {q = 4). Recent results of 
Roch [Roc09], building on [Roc08, PR09], show that for any reversible mutation 
matrix Q, it is possible to reconstruct phylogenetic trees from O (log (n)) -length 
sequences again when r+(T) < g-L\n{Q). 

However, these results leave the following important problem open: 

• As we mentioned before, for symmetric models on g > 5 states, it is known 
that ASR is possible for T+(r) < 5ml where 5ml(<3) > gun{Q)- A 
natural question is to ask if the "threshold" for PTR is gML{Q) (i.e., the 
threshold for ASR) or guniQ) or perhaps another value. (Note that for the 
CFN model, the threshold for PTR has been shown to be fl'Lin(Q) but in that 
case it so happens that gun{Q) = gMhiQ)-) 

Our contributions. Our main results are the following: 

• We show that for symmetric models Q with large q, it is possible to recon- 
struct phylogenetic trees with 0(log n)-length sequences whenever r+ (T) < 
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g+ where gun{Q) < Qq < guhiQ)- We thus show that PTR from logarith- 
mic sequences is sometimes possible for branch lengths above the KS bound. 

• We also show how to generalize the arguments of [Mos03, Mos04] to show 
that for any Q and r+(T) > guhiQ) it holds that correct phylogenetic 
reconstruction requires polynomial-length sequences in general. The same 
idea is used in [Mos03, Mos04] and the argument presented here. The main 
difference is that in the arguments in [Mos03, Mos04] used mutual informa- 
tion together with coupling while the more elegant argument presented here 
uses coupling only. The results of [Mos03] apply for general models but 
are not tight even for the CFN model. The argument in [Mos04] gives tight 
results for the CFN model. It is possible to extend that argument to more 
general models, but we prefer the simpler proof given in the current paper. 

Organization. We begin with preliminaries and the formal statements of our re- 
sults in Section 2. The proof of our upper bound can be found in Section 3. The 
proof of our lower bound can be found in Section 4. 

2 Definitions and Results 
2.1 Basic Definitions 

Phylogenies. We define phylogenies and evolutionary distances more formally. 

Definition 1 (Phylogeny) A phylogeny is a rooted, edge-weighted, leaf-labeled 
tree T = {V, E, [n], p; r) where: V is the set of vertices; E is the set of edges; 
L = [n] = {1, . . . ,n} is the set of leaves; p is the root; t : E ^ (0, +oo) is a 
positive edge weight function. We further assume that all internal nodes in T have 
degree 3 except for the root p which has degree 2. We let Y„ be the set of all such 
phylogenies on n leaves and we denote Y = {Y,„},„>i. 

Definition 2 (Tree Metric) For two leaves a,b [n], we denote by Path(a, b) the 
set of edges on the unique path between a and b. A tree metric on a set [n] is a 
positive function d : [n] x [n] — t- (0, +oo) such that there exists a tree T = (V, E) 
with leaf set [n] and an edge weight function w : E ^ (0, +oo) satisfying the 
following: for all leaves a,b £ [n] 




eePath(a,6) 
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For convenience, we denote by (''"(«, fee [n] ^^^^ metric corresponding to 
phylogeny T = (V^, [n\.,p] t). We extend t{u, v) to all vertices u,v £ V in the 
obvious way. 

Example 1 (Homogeneous Tree) For an integer h > we denote by T^^^ = 
{V^^\E^'^\L^^\p^^^;t) a rooted phylogeny where T^'^^ is the h-level complete 
binary tree with arbitrary edge weight Junction r and L^^^ = [2^]. for < h' < h, 
we let L^l^) be the vertices on level h — h' (from the root). In particular, L^^^ = L^^^ 

and lP = We let MY = {!!¥„} n>i be the set of all phylogenies with 

homogeneous underlying trees. 

Model of molecular sequence evolution. Phylogenies are reconstructed from 
molecular sequences extracted from the observed species. The standard model 
of evolution for such sequences is a Markov model on a tree (MMT). 

Definition 3 (Markov Model on a Tree) Let q > 2. Let n > 1 and let T = 

{V,E, [n], p) be a rooted tree with leaves labeled in [n]. For each edge e € E, 
we are given a q x q stochastic matrix = (Mj? )j jgjg], with fixed stationary 
distribution vr = (vrj)jg[g]. An MMT {{M'^}e£E,T) associates a state in [q] to 
each vertex v in V as follows: pick a state for the root p according to tt; moving 
away from the root, choose a state for each vertex v independently according to the 
distribution (M|^ ^^^^ ^ — (""i where u is the parent of v. 

The most common MMT used in phylogenetics is the so-called general time- 
reversible (GTR) model. 

Definition 4 (GTR Model) Let [q] be a set of character states with q = \ [q] \ and 
TT be a distribution on [q] satisfying vrj > for all i G [q]. For n > 1, let T = 
iy, E, [n], p; t) be a phylogeny. Let Q be a q x q rate matrix, that is, Qij > Ofor 
all i ^ j and Qij = 0, for all i G [q]. Assume Q is reversible with respect 

to IT, that is, TTiQij = iTjQji, for all i,j £ [q]. The GTR model on T with rate 
matrix Q is an MMT on T = iV, E, [n], p) with transition matrices IvP = e^"^, 
for all e £ E. By the reversibility assumption, Q has q real eigenvalues = Ai > 
A2 > • • • > Ag. We normalize Q by fixing A2 = — 1. We denote by Qq the set of 
all such rate matrices. We let Gn,q = Y„ <^ Qq be the set of all q-state GTR models 
on n leaves. We denote Gq = {^in,g}„>i- We denote by sw the vector of states on 
the vertices W ^V. In particular, are the states at the leaves. We denote by 
Cq-^Q the distribution o/s[„]. 

GTR models are often used in their full generality in the biology literature, but 
they also encompass several popular special cases such as the CFN model and the 
JC model. 
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Example 2 (g-State Symmetric Model) The g-state Symmetric model ( also called 
q-state Potts model) is the GTR model with q > 2 states, vr = {1/q, . . . ,1/q), and 
Q = Q^i) where 

'■J - o.w. 

I 1 

It is easy to check that ^2{Q) = — 1- The special cases q = 2 and q = 4 are called 
respectively the CFN and JC models in the biology literature. We denote their rate 
matrices by Q*^™ , Q'^^. For an edge e of length Tg > 0, let 
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;i-e--). 



Then, we have 



I - {q- l)5e ifi = j 
o.w. 



Phylogenetic reconstruction. A standard assumption in molecular evolution is 
that each site in a sequence (DNA, protein, etc.) evolves independently accord- 
ing to a Markov model on a tree, such as the GTR model above. Because of the 
reversibility assumption, the root of the phylogeny cannot be identified and we 
reconstruct phylogenies up to their root. 

Definition 5 (Phylogenetic Reconstruction Problem) Let Y = {Y„}„>i be a 
subset of phylogenies and Qg be a subset of rate matrices on q states. Let T = 
{V,E, [n],p;r) G Y. If T = {y,E, [n],p) is the rooted tree underlying T, we 
denote by T^\T\ the tree T where the root is removed: that is, we replace the 
two edges adjacent to the root by a single edge. We denote by T„ the set of all 
leaf-labeled trees on n leaves with internal degrees 3 and we let T = {T„}„>i. 
A phylogenetic reconstruction algorithm is a collection of maps A = {An^k}n,k>i 
from sequences (sj^j)^^]^ G ([9]^"')'^ to leaf-labeled trees T € T„. We only con- 
sider algorithms A computable in time polynomial in n and k. Let k{n) be an 
increasing function of n. We say that A solves the phylogenetic reconstruction 
problem onY ®Q^q with sequence length k = k{n) if for all 5 > 0, there is no > 1 
such that for all n > uq, T £ Y„, Q £ Q„, 



p[A,.(n) ((«!„])£)) =r_[r] 

where (sr„i)*!i"^ are i.i.d. samples from Cf^q. 



> 1 
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An important result of this kind was given by Erdos et al. [ESSW99]. Let a > 1 
and q > 2. The set of rate matrices Q G Qg such that tr((5) > —a is denoted 
Qq^a- Let < f < g < +oo and denote by Y'^'^ the set of all phylogenies 
T = {V, E, [n], p; t) satisfying f < Te < g, E. Then, Erdos et al. showed 

(as rephrased in our setup) that, for all a > (7 — 1, g > 2, and all < / < 5 < +00, 
the phylogenetic reconstruction problem on Y-^'^ (g) Qg ^ can be solved with k = 
poly(n). (In fact, they proved a more general result allowing rate matrices to vary 
across different edges.) In the case of the Potts model, this result was improved 
by Daskalakis et al. [DMR09] (building on [Mos04]) in the Kesten-Stigum (KS) 
reconstruction phase, that is, when g < gun{Q) = ^Lin = In \/2. They showed 
that, for all < / < 5 < ^^in' phylogenetic reconstruction problem on Y-'^'^ (g) 
{Q^*^^} can be solved with k = 0(log(?i)). More recently, the latter result was 
extended to GTR models by Roch [Roc09], building on [Roc08, PR09]. But prior 
to our work, no PTR algorithm had been shown to extend beyond (7^111- 

2.2 Our Results 

Positive result. In our first result, we extend logarithmic reconstruction results for 
(/-state symmetric models to In \/2 < g < In 2 for large enough q. This is the first 
result of this type beyond the KS bound. 

Theorem 1 (Logarithmic Reconstruction beyond the KS Transition) Let < 

f < g < +00 and denote by HY-^'^ the set of all homogeneous phylogenies 
T = {V^E, [n],p\T) satisfying f < < g, ^/e £ E. Let g^^^^ = In 2. Then, for 
all < f < g < (7p(jj,(,, there is R > 2 such that for all q > R the phylogenetic 
reconstruction problem on HY-^'^ {Q^*^-*} can be solved with k = 0(Iog(n)). 

Theorem 1 can be extended to general phylogenies using the techniques of [DMR09], 
although then one requires discretized branch lengths. See [DMR09] for details. 

Negative result. In our second result, we show that for g > guhiQ) the number 
of samples k must grow polynomially in n. In particular, this is true for the g-state 
symmetric model for all g > 2 and g > In 2 by the results of [MosOl]. 

Theorem 2 (Polynomial Lower Bound Above guhiQ) (see also [Mos03, Mos04])) 

Let Q G Qg and f = g > guhiQ)- Then the phylogenetic reconstruction problem 
on HY'^'^ (8) {Q} requires k = Q{n°')for some a > (even assuming Q and g are 
known exactly beforehand). 

Remark 1 (Biological Convention) Our normalization ofQ differs from standard 
biological convention where it is assumed that the total rate of change per unit time 



8 



at stationarity is 1, that is, 




See e.g. [Fel04]. Let —Xq denote the largest negative eigenvalue under this con- 
vention. Then, the Kesten-Stigum bound is given by the solution to 



3 Upper Bound for Large q 
3.1 Root Estimator 

The basic ingredient behind logarithmic reconstruction results is an accurate es- 
timator of the root state. In the KS phase, this can be achieved by majority-type 
procedures. See [Mos98, EKPSOO, Mos04]. In the reconstruction phase beyond 
the KS phase however, a more sophisticated estimator is needed. In this subsection 
we define an accurate root estimator which does not depend on the edge lengths. 

Random Cluster Methods. We use a convenient percolation representation of 
the feiTomagnetic Potts model on trees. Let q > 2 and T = {V, E, [n], p; r) G 
MYn with corresponding {6e)e€E- Run a percolation process on T = {V, E) where 
edge e is open with probability 1 — qSe- Then associate to each open cluster a state 
according to the uniform distribution on [q]. The state so obtained {sy)y(zv has the 
same distribution as the GTR model (T, Q^'^^)- 

We will use the following definition. Let T' be a subtree of T which is rooted 
at p. We say that T' is an l-diluted binary tree if, for all s, all the vertices of T' 
at level si have exactly 2 descendants at level (s + 1)1. (Assume for now that 
log2 n is a multiple of /.) For a state i G [q] and assignment at the leaves, we 
say that the event Bi^i holds if there is a ^-diluted binary tree with state i at all its 
leaves according to Let Bi be the set of all i such that Bi^i holds. Consider the 
following estimator: pick a state X uniformly at random in [q] and let 



We use the following convention. If log2 n is not a multiple of I, we add levels of 
0-length edges to T so as to make the total number of levels be a multiple of / and 



For instance, in the Jukes-Cantor model one has 



91.UQ) = 




pick uniformly in [g] — {X}, o.w. 



if X eBi 
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we copy the states at the leaves of T to all their descendants in the new tree. We 
then apply the estimator as above. 

Error Channel. We show next that Sp is a good estimator of the root state under 
the conditions of Theorem 1. Let 

Proposition 1 shows that this "error channel" is of the Potts type with bounded 
length, no matter how deep the tree. The key behind our reconstruction algorithm 
in the next section will be to think of this error channel as an "extra edge" in the 
Markov model. 

Proposition 1 (Root Estimator from Diluted Trees) Let g^^^^ = In 2. Then, for 
all < g < gp^j.^ , we can find I > 0, R > 2 and < b < +00 such that 

where hp<handQ = Q^^l for all q > R and all T G HY"'^'. 

Proof: The proof is based on a random cluster argument of Mossel [MosOl]. Fix 
< f < g < gpei-c- [MosOl], it is shown that one can choose e > small 
enough and /, R large enough such that 

F[B.^ \sp = i]>e, (1) 

and 

F[Bi^i I Sp / f] < e/2, (2) 

for allq> R and all T = (V, E, [n] ,p;t) e MY°'f . The proof in [MosOl] actually 
assumes that all Tg's are equal to g. However, the argument still holds when Tf. < g 
for all e since smaller r's imply smaller (5's which can only strenghten inequalities 
(1) and (2) by a standard domination argument. (For (2), see the original argument 
in [MosOl].) 

Therefore, we have 

Mlt = ¥\i (^Bi\sp = i]¥\X = i] + ^F[X ^Bi\sp = i,X^ i]P[X ^ i] 
1 e 
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Also, by symmetry, we have for i / j 




( 1 - 

q-l\ 
1 e 



) 



< 



q 2q{q-l)- 



Hence, the channel M^' is of the form e^^*^ with bp < b where, by the relation 
between 6 and r given in Example 2, we can take 



This concludes the proof. ■ 

3.2 Reconstruction Algorithm 

Our reconstruction algorithm is based on standard distance-based quartet tech- 
niques. Let T = {V,E, [n],p;T) G MY^'f be a homo geneous phylogeny that 
we seek to reconstruct from k samples of the corresponding Potts model at the 



Distances. For two nodes u,v gV, we may relate their distance to the probabil- 
ity that their states agree 



Of course, given samples at the leaves, this estimator can only be used for u,v £ 
[n]. Instead, when u,v we. internal nodes we first reconstruct their sequence us- 
ing Proposition 1 . We will then over-estimate the true distance by an amount not 
exceeding 2b on average. For u,v — [n], let 



leaves {s\.%^ G {[qpf- 




and so a natural way to estimate t{u, v) is to consider the estimator 




t{u,v) + b„ + b. 
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using the notation of Proposition 1. We also let {s^}^^]^, {s^}^^]^ be the recon- 
structed states at u, v. By convention we let 

Tb(a,6) = T(a,6), 

and 

~ ^a^ yi = 1, . . . , /c, 

for a, & G [n]. Note that, at the beginning of the algorithm, the phylogeny is not 
known, making it impossible to compute {s\^}^^^ for internal nodes. However 
as we reconstruct parts of the tree we will progressively compute the estimated 
sequences of uncovered internal nodes. 

By standard concentration inequalities, rb(ti, v) can be well approximated with 
k = O(logn) as long as rb(u, v) = 0(1). For u,v let 

Recall the notation of Example 1. 

Lemma 1 (Distorted Metric: Sliort Distances [ESSW99]) Let < h' < h and 

let u,v £ L^l^} be distinct leaves. For all D > 0, 6 > 0, j > 0, there exists 
c = c{D, 6, 7) > 0, such that if the following conditions hold: 

• [Small Diameter] rb(u, < D, 

• [Sequence Length] k = c' log nfor c' > c, 
then 

\Th{u,v) - f{u,v)\ < 6, 
with probability at least 1 — n~'^. 

Lemma 2 (Distorted Metric: Diameter Test [ESSW99]) Let < h' < h and 

u,v e L^j^\ For all D > 0, W > 5, ^ > 0, there exists c = c{D, W, 7) > 0, such 
that if the following conditions hold: 

• [Large Diameter] rb(u, v) > D + \aW, 

• [Sequence Length] k = c' log nfor c' > c, 
then 

W 

f{u,v) > D + ln—, 

with probability at least 1 — n~"'. On the other hand, if the first condition above is 
replaced by 
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• [Small Diameter] Tb(ti, v) < D + In^, 
then 

W 

f{u,v) < D + ln— , 
with probability at least 1 — n~^. 

Quartest Tests. Let < h' < h and Qo = {ao,bo,co,do} C L^^\ The topology 
of T^'^^ restricted to Qo is completely characterized by a bi-partition or quartet 
split qq of the form: ao&o|coi3?o> aocolftoc^o or aodol^oco- The most basic operation 
in quailet-based reconstruction algorithms is the inference of such quartet splits. 
In distance-based methods in particular, this is usually done by performing the 
so-called /oMr-po/«f test: letting 

J^{aobo\codo) = ^[r(ao,co) + T{bo,do) -r(ao,5o) -r(co,(io)], 
we have 

' ao^olcodo if ^(ao,6o|co,do) > 
qo= I aoco|6odo if T{ao,bo\co,do) < 
_ aodol&oco o.w. 

Note that adding "extra edges" at the nodes oq, foo; cq, do as implied in Proposition 1 
does not affect the topology of the quartet. 

Since Lemma 1 applies only to short distances, we also perform a diameter test. 

We let J^(ao6o|codo) = +oo if max^^^gg^ f{u, v) > D + ln^ and otherwise 

-F(ao6o|codo) = ^['r(ao,co) + f(6o,£io) -^(00,^0) - ^(co, 4)]- 
Finally we let 

FP(oo,6o|co,do) = l{^{aobo\codo) > f/2}. 

Algorithm. The algorithm is detailed in Figure 1 . The proof of its correctness is 
left to the reader. This concludes the proof of Theorem 1. 

4 General Lower Bound 

Here we prove the following statement which implies Theorem 2: 
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Algorithm 

Input: Sequences {s\^^)'l^i G ([<?]["!)''; 
Output: Tree; 

• Let Zq be the set of leaves. 

• For /i' = 0, . . . , /i - 1, 

1. Four-Point Test. Let 

Ti-h' = {q = ab\cd : Va, b,c,d e Zh> distinct such that FP(g) = 1}. 

2. Clierries. Identify the cherries in TZw, that is, those pairs of vertices that 
only appear on the same side of the quartet splits in TZh'- Let 

„ _ r (h' + l) (h' + l) 1 

be the parents of the cherries in Zh' ■ 

3. Reconstructed Sequences. For all u e Zh'+i^ compute (s^)*Lj. 

Figure 1: Algorithm. 

Theorem 3 (Polynomial Lower Bound on PTR) Consider the phylogenetic re- 
construction problem for homogeneous trees with fixed edge length r(e) = r > 
for all edges e E. Assume further that the ASR problem for edge length r 
and matrix Q is not solvable and that moreover t > (7LiiT Then there exists 
a = a(r) > such that the probability of correctly reconstructing the tree is 
at most 0{n~°') assuming k < n". 

For general mutation rates Q, it is not known if there is a unique reconstruction 
threshold guhiQ) such that ASR is possible for r < guhiQ) and impossible for 
T > 9ml{Q)- For models for which such a threshold exists Theorem 3 above 
shows the impossibility of phylogenetic reconstruction for r > (7ml (Q)- The 
existence of the threshold guhiQ) has been established for a few models, e.g. for 
so-called random cluster models, which include the binary asymmetric channel and 
the Potts model [MosOl]. 

The proof of Theorem 3 is based on the following two lemmas. It is useful to 
write n = 2^ for the number of leaves of a homogeneous tree with £ levels. 

Lemma 3 (Reconstructing a Deep Subtree) Consider the PTR problem for ho- 
mogeneous trees with fixed edge length r. Let fig denote the distribution at the 
leaves on a homogeneous i-level tree with fixed edge length r, root value i, and 
rate matrix Q. Suppose there exists a number < a < 1 such that for every 
£ and all i one can write fj-q" = {1 — e)fl + Sfi'^ for some probability measures 
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G [q], fi, and e = 0{2 ). Then the probability of correctly reconstructing 
homogeneous phylogenetic trees with edge length r assuming k < n°'/^^ is at most 

Lemma 4 (Leaf Distribution Decomposition) Consider the ASR problem for ho- 
mogeneous trees with fixed edge length t. Assume further that the ASR prob- 
lem for Q with edge length r is not solvable and further r > ^liit ^'^^'^ tlwre 
is an a = a(r) > for which the following holds. There exists a sequence 
El = 0(2"°^) such that for all i G [q] one can write fiq = (1 — e)p, + e^'^ for 
some probability measures /u'*, i G [q\ and p,. 

Proof of Lemma 3: Let r be chosen so that 2''"^ < n^/^o < 2'''. (Note that 
r < £.) Consider the following distribution: first, pick a homogeneous tree T on 
£ levels, where the first r levels are chosen uniformly at random among r-level 
homogeneous trees and the remaining levels are fixed (i.e., deterministic); second, 
pick k samples of a Maitov model with rate matrix Q and fixed edge length r on 
the resulting tree. 

Let ^ be a phylogenetic reconstruction algorithm. Our goal is to bound the 
success probability of A on the random model above. We may assume that the 
bottom £ — r levels are given to A (as it may ignore this information) and that A is 
deterministic (as a simple convexity argument shows that deterministic algorithms 
achieve the highest success probability). 

Note that the assumption of the lemma implies that, for a single sample, we 
can simultaneously couple the distribution at the leaves of all the given subtrees of 
£-r levels— except with probability 0(2'^2-"('^-'^)) = 0{n~^'^/^^). This can be 
achieved by starting the coupling at level r (from the root) of the tree. Repeating 
this for the n"/^*^ samples we obtain the following. Let i^lt denote the measure 
on the n"/^'^ samples at leaves of T. Then there exists measures fx, fi'rp and e = 
Q(^-8a/i0) such that /^^ = (1 - + e^T- 

Write Nj. for the number of leaf-labelled complete binary trees on r levels. 
Write 8{s, A, T) for the indicator of the event that the k samples are given by s 
and that A recovers T. The success probability of A is then given by 

= (1 - e)N-' J2 E ^(^(^' ^)) + ^^r' E E (s, A T)). (3) 

s T T s 

For the second term note that 

^f,'Ti£{s,A,T))<Y,Ms) = l, 

s s 
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and therefore the second term in (3) is bounded by e. Furthermore for each s, 
fj,{£{s, A, T)) = fi{s) by definition and J2s /^(^) = 1 so the first term in (3) 

is bounded by (1 — e)N~'^. 

Thus overall, the bound on the probability of con^ect reconstruction is e + 

(1 - e)iV-i. Using the facts that A^^ = n{2'^") = J7(2"° '") = and 

e = 0{n~^"'/'^^) concludes the proof. ■ 

Proof of Lemma 4: For 6 > and r' > 0, let f./g '' be the same measure as 

i—r' i 

' , except that, for each leaf, independently with probability 1 — o, the state at 
the leaf is replaced by * (which does not belong to the original alphabet). The key 
to the proof is the main result of [JM04] where it is shown that if r > (/^j^ then the 
following holds: There exist fixed 5 > 0, a > such that 

f,'-'-''\6) = {l-e)m+ef^'\S), (4) 

where e = 0(2-"(^-^')) for some probability measures /^f" ((5) and fi{5). 

The fact that there is no reconstruction (ASR) at edge length r implies that 
there exists a fixed r' and measures u and v''^ such that 

= (1 - S)u + 5u'\ 

This implies in particular that we can simulate the mutation process on an ^-level 
tree by first using the measure /iQ*(5) and then applying the following rule: for 
each node v at level i — r' independently 

• If the label at t; is * then generate the leaf states on the subtree rooted at v 
according to the measure u. 

• Else if it is labeled by i, sample leaf states on the subtree below v from the 
measure i/" . 

The desired property of the measures ^q' now follows from the fact that the mea- 
sures fJ'o'iS) have the desired property by (4). ■ 
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